Finding a cur decomposition

ABSTRACT

One embodiments is a computer-implemented method for finding a CUR decomposition. The method includes constructing, by a computer processor, a matrix C based on a matrix A. A matrix R is constructed based on the matrix A and the matrix C. A matrix U is constructed based on the matrices A, C, and R. The matrices C, U, and R provide a CUR decomposition of the matrix A. The construction of the matrices C, U, and R provide at least one of an input-sparsity-time CUR and a deterministic CUR.

BACKGROUND

Various embodiments of this disclosure relate to CUR decompositions and,more particularly, to improved techniques for finding a CURdecomposition.

CUR decompositions are often applied to the field of data compression.If data is in the form of an m×n matrix A, then storage space of O(mn)is required to store this data. CUR decompositions compress the data byremoving some redundancy and reducing the rank of A, hence requiringreduced storage.

Given as inputs a matrix A∈

^(m×n) and integers c<n and r<m, the CUR decomposition, orfactorization, of A finds C∈

^(m×c) with c columns of A, R∈

^(r×n) with r rows of A, and U∈

^(c×r), such that A=CUR+E. The value of E=A−CUR represents the residualerror matrix.

Generally, where matrix A represents some data set, the CURdecomposition of A (i.e., C*U*R) approximates A. For example, a CURdecomposition can be used to compress the data in A by removing some ofthe redundancy in A. From an algorithmic perspective, it is a challengeto construct C, U, and R quickly and in such a way as to minimize theapproximation error ∥A−CUR∥_(F) ².

More precisely, the “CUR Problem,” as is will be referred to herein, isas follows: Given A∈

^(m×n) of rank ρ=rank(A), rank parameter k<ρ, and accuracy parameter0<ε<1, construct C∈

^(m×c) with c columns from A, R∈

^(r×n) with r rows from A, and U∈

^(c×r), with c, r, and rank (U) being as small as possible, in order toreconstruct A within relative-error:

∥A−CUR∥ _(F) ²≦(1+ε)∥A−A _(k)∥_(F) ².

In contrast to the above is the singular value decomposition (SVD)factorization, where k<rank (A), and A @U_(k) {circumflex over (l)}_(k)V_(k) ^(T). A_(ρOk) The SVD residual error A_(ρOk) is the best possible,under certain rank constraints. The matrices U_(k)∈

^(m×k) and V_(k)∈

^(n×k) contain the top k left and right singular vectors of A, whileΣ_(k)∈

^(k×k) contains the top k largest singular values of A. In CUR, C and Rcontain actual columns and rows of A, a property desirable for featureselection and data interpretation when using CUR as an approximation ofA. Because C and R are actual columns and rows of A, sparsity in theinput matrix A is also preserved in CUR. Thus, CUR is attractive in awide range of applications.

SUMMARY

In one embodiment of this disclosure, a computer-implemented methodincludes constructing, by a computer processor, a matrix C based on amatrix A. A matrix R is constructed based on the matrix A and the matrixC. A matrix U is constructed based on the matrices A, C, and R. Thematrices C, U, and R provide a CUR decomposition of the matrix A. Theconstruction of the matrices C, U, and R provide at least one of aninput-sparsity-time CUR and a deterministic CUR.

In another embodiment, a system includes a first construction unit, asecond construction unit, and a third construction unit. The firstconstruction unit is configured to construct, by a computer processor, amatrix C based on a matrix A. The second construction unit is configuredto construct a matrix R based on the matrix A and the matrix C. Thethird construction unit is configured to construct a matrix U based onthe matrices A, C, and R. The matrices C, U, and R provide a CURdecomposition of the matrix A. The construction of the matrices C, U,and R provide at least one of an input-sparsity-time CUR and adeterministic CUR.

In yet another embodiment, a computer program product for finding a CURdecomposition includes a computer readable storage medium having programinstructions embodied therewith. The program instructions are executableby a processor to cause the processor to perform a method. The methodincludes constructing a matrix C based on a matrix A. Further accordingto the method, a matrix R is constructed based on the matrix A and thematrix C. A matrix U is constructed based on the matrices A, C, and R.The matrices C, U, and R provide a CUR decomposition of the matrix A.The construction of the matrices C, U, and R provide at least one of aninput-sparsity-time CUR and a deterministic CUR.

Additional features and advantages are realized through the techniquesof the present invention. Other embodiments and aspects of the inventionare described in detail herein and are considered a part of the claimedinvention. For a better understanding of the invention with theadvantages and the features, refer to the description and to thedrawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The subject matter which is regarded as the invention is particularlypointed out and distinctly claimed in the claims at the conclusion ofthe specification. The forgoing and other features, and advantages ofthe invention are apparent from the following detailed description takenin conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of a decomposition system, according to someembodiments of this disclosure;

FIG. 2 is a flow diagram of a method for finding a CUR decomposition ofa matrix A, according to some embodiments of this disclosure;

FIG. 3 is a flow diagram of a method for finding a CUR decomposition ofa matrix A in input-sparsity time, according to some embodiments of thisdisclosure;

FIG. 4 is a flow diagram of a third method for finding a deterministicCUR decomposition of a matrix A, according to some embodiments of thisdisclosure; and

FIG. 5 is a block diagram of a computing device for implementing some orall aspects of the decomposition system, according to some embodimentsof this disclosure.

DETAILED DESCRIPTION

Various embodiments of decomposition systems and methods herein provideimproved techniques for finding CUR decompositions. When used forcompression, embodiments of the decomposition system may compress thedata by removing redundancy and retaining only necessary information,hence requiring reduced less storage. Some embodiments may retain actualdata from the input matrix A, which can offer an advantage over existingapproaches in terms of interpretability. Further, as compared to CURtechniques used previously, some embodiments may perform faster and mayprovide a shorter output description than alternatives.

Despite the significant amount of work and progress on CUR, at least thefollowing questions have remained unanswered, given an accuracyparameter 0<ε<1: (1) Optimal CUR: Are there any (1+ε)-error CURalgorithms selecting the optimal number of columns and rows? (2) Rank-kCUR: Are there any (1+ε)-error CUR algorithms constructing A withoptimal rank? (3) Input-sparsity-time CUR: Are there anyinput-sparsity-time (1+ε)-error algorithms for CUR? (4) DeterministicCUR: Are there any deterministic, polynomial-time (1+ε)-error CURalgorithms? The present decomposition systems and methods may addressthese open problems related to CUR decompositions.

The decomposition system may be based on technical contributions to theart provided in this disclosure, answering the open problems related tooptimal CUR, rank-k CUR, input-sparsity-time CUR, and deterministic CUR.

One such technical contribution is an input-sparsity-time,relative-error CUR algorithm selecting an optimal number of columns androws and constructing a U matrix with optimal rank.

It can be shown that, given A∈

^(m×n) of rank ρ, a target rank 1≦k<ρ, and 0<ε<2, there exists such arandomized algorithm to select at most c=O(k/ε) columns and at mostr=O(k/ε) rows from A, to form matrices C∈

^(m×c), R∈

^(r×n), and U∈

^(c×r) with rank(U)=k such that, with constant probability of success,∥A−CUR∥_(F) ²≦(1+O(ε))∥A−A_(k)∥_(F) ².

Further, the matrices C, U, and R can be computed in time O(nnz(A)logn+(m+n)·poly(log n, k, 1/ε)).

Using such an algorithm the present decomposition system may be capableof achieving a relative-error CUR decomposition with O(k/ε) number ofrows and columns, which is a task that was not achieved throughconventional techniques. This number of rows and columns isasymptotically optimal.

A second technical contribution on which the present decompositionsystem may be based is a deterministic, relative-error CUR algorithmthat runs in time O(mn³k/ε) (i.e., polynomial time). With thisalgorithm, the decomposition system may construct C with c=O(k/ε)columns, R with r=O(k/ε) rows, and U of rank k. Existing work provides arelative-error, deterministic algorithm based on volume sampling,constructing C with O(k/ε) columns of A such that ∥A−CC⁻A∥_(F)²≦(1+ε)∥A−A_(k)∥_(F) ² However, it is not obvious how to extend this toa rank k, column-based, relative-error decomposition, which is a harderinstance of the problem. The best existing deterministic algorithm for arank k, column-based, decomposition achieves only a (2+ε)-error.

FIG. 1 is a block diagram of a decomposition system 100, according tosome embodiments of this disclosure. As shown, the decomposition system100 may include a C-construction unit 110, an R-construction unit 120,and a U-construction unit. In general, the C-construction unit 110 mayconstruct matrix C of a CUR decomposition of matrix A; theR-construction 120 unit may construct matrix R of the CUR decomposition;and the U-construction unit 130 may construct matrix U of the CURdecomposition based on C and R. Each of these units 110, 120, 130 may beimplemented in hardware, software, or a combination thereof. Theconstruction of each of these matrices will be described further below.

Following is an overview of various techniques used by embodiments ofthe decomposition system 100. A useful starting point of the presentdecomposition system 100 is the following tool from prior work, whichconnects matrix factorizations and column subset selection.

The lemma immediately below has been shown in Boutsidis (C. Boutsidis,P. Drineas, and M. Magdon-Ismail; Near optimal column based matrixreconstruction; SIAM Journal on Computing (SICOMP), 2013):

The following is referred to herein as Lemma 1: Let A=AZZ^(T)+E∈

^(m×n) be a low-rank matrix factorization of A, with Z∈

^(n×k) and Z^(T)Z=I_(k). Let S∈

^(n×c) (c≧k) be any matrix such that rank(Z^(T)S)=rank(Z)=k. Let C=AS∈

^(m×c). Then ∥A−CC^(†)A∥_(F) ²≦∥A−Π_(C,k) ^(F)(A)∥_(F) ²≦∥E∥_(F)²+∥ES(Z^(T)Z)^(†)∥_(F) ².

Here, Π_(C,k) ^(F)(Λ)−CX_(opt)∈

^(m×n) where X_(opt)∈

^(c×n) has rank at most k and CX_(opt) is the best rank-k approximationto A in the column span of C.

If S samples columns from A (i.e., C=AS) consist of original columns ofA, then, using the above lemma, one obtains a bound for a column-based,low-rank matrix approximation of A. This bound depends on the specificchoice of Z.

Some embodiments of the decompositions system obtain an optimal,relative-error, rank-k CUR by using this lemma twice and adaptively, onetime to sample columns from A and the other time to sample rows. Thistechnique is “adaptive” in that the two sampling processes are notindependent. Rather, the rows sampled in the second application of thelemma depend on the columns sampled in the first application. Anoverview of this approach follows.

Assume that for an appropriate matrix Z₁∈

^(n×k), one can find S₁∈

^(n×c) ¹ that samples c₁=O(k) columns from A such that after applyingLemma 1 with A and Z₁ gives (C₁=AS₁; E₁=A−AZ₁Z₁ ^(T)):

∥A−C ₁ C ₁ ^(†) A∥ _(F) ² ≦∥E ₁∥_(F) ² +∥E ₁ S ₁(Z ₁ ^(T) S ₁)^(†)∥_(F)² ≦O(1)∥A−AZ ₁ Z ₁ ^(T)∥_(F) ² ≦O(1)∥A−A _(k)∥_(F) ²,

where in the third inequality, it is further assumed that

∥E ₁∥_(F) ² =∥A−AZ ₁ Z ₁ ^(T)∥_(F) ² ≦O(1)∥A−A _(k)∥_(F) ².

In other words, Z₁ may approximate, within a constant factor, the bestrank-k SVD of A. One choice for Z₁ is the matrix V_(k) from the rank-kSVD of A. However, since this choice is costly, some embodiments hereinuse methods that approximate the SVD. The bound in the second inequalityalso requires that the term ∥E₁S₁(Z₁ ^(T)S₁)^(†)∥_(F) ² is sufficientlysmall, specifically ∥E₁S₁(Z₁ ^(T)S₁)^(†)∥_(F) ²≦O(1)∥E₁∥_(F) ². This canbe achieved by choosing the sampling matrix S₁ carefully, which isfurther discussed below.

Once the matrix C₁ with O(k) columns of A is obtained, whichapproximates the best rank-k matrix within a constant factor, anadaptive sampling method, e.g., the adaptive sampling method ofDeshpande (A. Deshpande, L. Rademacher, S. Vempala, and G. Wang; Matrixapproximation and projective clustering via volume sampling; inProceedings of the 17th Annual ACM-SIAM Symposium on DiscreteAlgorithms, pages 1117-1126, 2006), to sample O(k/ε) columns to obtain amatrix C with c=O(k)+O(k/ε) columns for which ∥A−CC^(†)A∥_(F)²≦∥A−CX_(opt)∥_(F) ²<(1+ε)∥A−A_(k)∥_(F) ².

Herein, the above formula is referred to as Equation 1.

The adaptive sampling may convert a constant factor approximation to arelative-error one, by sampling an additional of O(k/ε) columns. MatrixX_(opt)∈

^(c×n) has rank at most r. Matrix CX_(opt) is the best rank-kapproximation of A in the column span of C, which would be the matrixwith columns of A in the CUR decomposition that is sought.

Lemma 1 may be used again to sample rows from A, i.e., by applying thelemma to A^(T). If matrix C had orthonormal columns, then it couldimmediately play the role of Z in Lemma 1. In that case, E=A^(T)−A^(T)CC^(T). However, matrix C is not orthonormal. An orthonormal matrix witha similar bound as in Equation 1 is sought. A possible choice is to takeZ to be an orthonormal basis of C. However, this choice is not idealbecause, in that case, Z would have dimensions m×c. Hence, to satisfythe rank assumption in Lemma 1, one would need to sample at least c rowsfrom Z∈

^(m×c). Recall that c=O(k/ε), and therefore O(k/ε²) rows would need tobe sampled in this operation. However, it is desired to obtain a CURdecomposition with only O(k/ε) rows. To achieve that, O(k/ε) rows is toomany to select at this point. Rather, some embodiments may select O(k)rows, because otherwise the adaptive sampling operation that follows,which may be unavoidable, would sample O(k/ε²) rows from A. What issought is therefore some matrix Z in the column span of C such thatZZ^(T)A approximates A as well as CX_(opt), and at the same time thenumber of columns of Z is O(k). To this end, a matrix Z₂ is constructed,where Z₂∈

^(m×k) and Z₂∈span(C), with ∥A^(T)−A^(T)Z₂Z₂ ^(T)∥_(F)²≦(1+O(ε))∥A−CX_(opt)∥_(F) ².

Herein, the above is referred to herein as Equation 2.

It is sought to apply Lemma 1 to A^(T) and Z₂∈

^(m×k). It is assumed that S₂∈

^(m×r) ² (i.e., R₁=(A^(T)S₂)^(T)∈

^(r×n)) can be found with r₁=O(k) rows, such that (E₂ ^(T)=A^(T)−A^(T)Z₂^(T)Z₂):

∥A−AR ₁ ^(†) R ₁∥_(F) ² ≦∥E ₂∥_(F) ² +∥E ₂ S ₂(Z ₂ ^(T) S ₂)^(†)∥_(F) ²=∥A−Z ₂ Z ₂ ^(T) A∥ _(F) ²+∥(A−Z ₂ Z ₂ ^(T) A)S ₂(Z ₂ ^(T) S ₂)^(†)∥_(F)² ≦O(1)∥A−Z ₂ Z ₂ ^(T) A∥ _(F) ².

Herein, the above is referred to herein as Equation 3.

The last inequality in the above requires that ∥E₂S₂(Z₂ ^(T)S₂)^(†)∥_(F)²=O(1)∥E₂∥_(F) ², which is possible after constructing the matrix S₂appropriately.

At this point, a C and a subset of rows R in the optimal CUR have beenfound. Two additional operations are desired: (1) The adaptive samplingmethod for CUR that appeared in Wang (S. Wang and Z. Zhang; Improvingcur matrix decomposition and the nystrom approximation via adaptivesampling; Journal of Machine Learning Research, 2013) may be used tofurther sample another r₂=O(k/ε) rows in R₁, i.e., construct R∈

^(r×n) with r=O(k+k/ε) rows. And (2) for Z₂Z₂ ^(T)AR^(†)R, replace Z₂with CΘ, for an appropriate Θ, and take U=ΘZ₂ ^(T)AR^(†)∈

^(c×r). Such a Θ always exists because Z₂ is in the span of C. Sooverall,

CUR=Z ₂ Z ₂ ^(T) AR ^(†) R,

and the bound is as follows:

${{{A - {Z_{2}Z_{2}^{T}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {Z_{2}Z_{2}^{T}A}}}_{P}^{2} + {\frac{{rank}\left( {Z_{2}Z_{2}^{T}A} \right)}{r_{2}}{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}}}} = {{{A - {Z_{2}Z_{2}^{T}A}}}_{P}^{2} + {{O(ɛ)}{{A - {A_{1}^{\dagger}R_{1}}}}_{F}^{2}}}$

The first inequality in the group directly above is from the adaptivesampling argument. The equality is from a selected choice of r₂. Thesecond inequality is from Equation 3. The third inequality is fromEquation 2. The last inequality is from Equation 1.

From all the above derivations, four sufficient conditions that give anoptimal, relative-error, rank-k CUR may be identified. These conditionsare referred to herein as primitives. By designing matrices C, U, and Rsatisfying those basic primitives, an optimal, relative-error, rank-kCUR may be secured, according to some embodiments.

To construct an optimal, relative-error, and rank-k CUR, the four basicprimitives that may be relied upon are: (1) There is a C∈

^(m×c), X_(opt)∈

^(c×n) with a relative-error bound to A−A_(k) and c=O(k/ε), as providedin Equation 1. (2) There is a Z₂∈

^(m×k)(Z₂∈span(C) and Z₂ ^(T)Z₂=I_(k)) with a relative-error bound toA−CX_(opt), as provided in Equation 2. (3) There is R₁∈

^(r) ¹ ^(×n) with a constant factor error to A−Z₂Z₂ ^(T)A and r₁=O(k),as provided in Equation 3. And (4) there is an adaptive samplingalgorithm for CUR.

In some embodiments, a user or administrator of the decomposition system100 may be enabled to select C, Z₂, and R₁ in the above conditions asdesired. Below, various implementation choices that lead to desirableCUR algorithms are further discussed.

To address Primitive 1 above, known ideas for column subset selection,including leverage-scores sampling as in Drineas (P. Drineas, M. W.Mahoney, and S. Muthukrishnan; Relative-error cur matrix decompositions;SIAM Journal Matrix Analysis and Applications, 30(2):844-881, 2008); BSSsampling, i.e., deterministic sampling similar to the method of Batson(J. Batson, D. Spielman, and N. Srivastava. Twice-ramanujan sparsifiers;In Proceedings of the 41st annual ACM symposium on Theory of computing,pages 255-262; ACM, 2009); and adaptive sampling, as in Deshpande, maybe combined. To find Z₁, techniques for approximating the SVD may beused. To address Primitive 2, methods to find low-rank approximationswithin a subspace may be used. To address Primitive 3, leverage-scoresampling and BSS sampling may again be used, as in Primitive 1. Toaddress Primitive 4, an adaptive sampling method, such as that of Wang,may be used to turn a constant factor CUR to relative error by samplingan additional O(k/ε) rows. These various operations are summarized inAlgorithm 1, which is referred to herein as a “proto-algorithm” for anoptimal, relative-error, rank-k CUR.

Algorithm 1 is a proto-algorithm for an optimal, relative-error, rank-kCUR. Algorithm 1 may be implemented on a computer system in hardware,software, or a combination thereof. Various operations performed in thisalgorithm follow immediately below:

Input: A∈

^(m×n); rank parameter k<rank(A); accuracy parameter 0<ε<1.

Output: C∈

^(m×c) with c=O(k/ε); R∈

^(r×n) with r=O(k/ε); U∈

^(c×r) with rank(U)=k.

(1) Construct C with O(k+k/ε) columns. This includes the operations of(1a)-(1d).

(1a) Approximate SVD: find Z₁∈

^(n×O(k)) that approximates V_(k)∈

^(n×k) from the SVD of A.

(1b) Leverage-scores sampling: Sample O(k log k) columns from A withprobabilities from Z₁.

(1c) BSS sampling: downsample these columns to O(k) columns.

(1d) Adaptive sampling: sample O(k/ε) additional columns.

(2) Construct R with O(k+k/ε) rows. This includes the operations of(2a)-(2d).

(2a) Restricted low-rank approximation: find Z₂∈

^(m×O(k)) with Z₂ æspan(C).

(2b) Leverage-scores sampling: sample O(k log k) rows from A withprobabilities from Z₂.

(2c) BSS sampling: downsample these rows to O(k) rows.

(2d) Adaptive sampling: sample O(k/ε) additional rows.

(3) Construct U of rank k. This includes the operations of (3a).

(3a) Let Z₂Z₂ ^(T)AR^(†)R; then, replace Z2=CΘ; for an appropriate Θ;and use U @T Z₂ ^(T)AR^(†).

FIG. 2 is a flow diagram of a method 200 for finding a CUR decompositionof a matrix A, using Algorithm 1, according to some embodiments of thisdisclosure. At a high level, matrix C may be constructed with O(k/ε)columns at block 210, matrix R may be constructed with O(k/ε) rows atblock 220, and matrix U may be constructed from matrices C and R atblock 230.

More specifically, block 210 for constructing C may include blocks 212,214, 216, and 218. At block 212, Z₁ may be computed as the top ksingular vectors of A, such that ∥A−AZ₁Z₁ ^(T)∥_(F) ²≦(1+ε)∥A−A_(k)∥_(F)². At block 214, O(k log k) columns may be sampled from the transpose ofZ₁ or, equivalently, from A, with leverage scores, i.e., Euclidean normsof the columns of Z₁ transpose. At block 216, the sampled columns may bedown-sampled to C₁∈

^(m×c) ¹ containing c₁=O(k) columns, such as by using BSS sampling, andsuch that ∥A−C₁C₁ ^(†)A∥_(F) ²≦O(1)∥A−AZ₁Z₁ ^(T)∥_(F) ². At block 218,c₂=O(k/ε) columns of A may be adaptively sampled, such that ∥A−C·D∥_(F)²≦∥A−A_(k)∥_(F) ²+(k/c₂)∥A−C₁C₁ ^(†)A∥_(F) ², where D is a rank-kmatrix.

Block 220 for constructing R may include block 222, 224, 226, and 228.At block 222, Z_(z)∈

^(m×k) may be found in the span of C, such that ∥A−AZ₂Z₂ ^(T)∥_(F)²≦(1+ε)∥A−A_(k)∥_(F) ². To do this efficiently, instead of projectingcolumns of A onto C, the decomposition system 100 may project columns ofAW onto C, where W is a random subspace embedding. The decompositionsystem 100 may then find a best rank-k approximate of the columns of AWin C. At block 224, O(k log k) rows with leverage scores may be sampledfrom Z₂. At block 226, those sampled rows may be down-sampled to R₁∈

^(r) ¹ ^(×n) containing r₁=O(k) rows, such as by BSS sampling, and suchthat ∥A−R₁R₁ ^(†)∥_(F) ²≦O(1)∥A−AZ₂Z₂ ^(T)A∥_(F) ². At block 228,r₂=O(k/ε) rows of A may be adaptively sampled, such that

${{A - {Z_{2}Z_{2}^{T}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {Z_{2}Z_{2}^{T}A}}}_{P}^{2} + {{{rank}\left( \frac{Z_{2}Z_{2}^{T}A}{r_{2}} \right)}{{{A - {{AR}_{1}R_{1}^{\dagger}}}}_{F}^{2}.}}}$

To obtain a deterministic or input-sparsity-time CUR, the decompositionsystem 100 may implement the corresponding operations of theproto-algorithm, as provided above, in an appropriate setting. This isfurther discussed below.

For an algorithm that runs in time proportional to the number ofnon-zero entries of A, one may to implement the four primitives ininput-sparsity time. Equivalently, all the operations of theproto-algorithm may be implemented in input-sparsity time. It is knownhow to implement approximate SVD in input-sparsity time. Given therelevant probabilities, the leverage-scores sampling operation can beimplemented fast as well. For the remainder of the operations of theproto-algorithm, new tools are needed and used by various embodiments.

An input-sparsity-time version of the BSS sampling operation may beused. To design this, as will be described further below, one combinesthe method from Boutsidis with ideas from the sparse subspace embeddingliterature of Clarkson (K. L. Clarkson and D. P. Woodruff; Low rankapproximation and regression in input sparsity time; In STOC, 2013).Various embodiments also use input-sparsity-time versions of theadaptive sampling algorithms of Deshpande and Wang. To this end, onecombines existing ideas in Deshpande and Wang with theJohnson-Lindenstrauss lemma. An input-sparsity-time algorithm to find arelative-error, low-rank matrix within a given subspace may also beused. To this end, one combines ideas for subspace-restricted matrixapproximations with ideas from Kannan (R. Kannan, S. Vempala, and D. P.Woodruff; Nimble algorithms for cloud computing; arXiv preprintarXiv:1304.3162, v3, 2013) and Clarkson. A method, presented below, maybe used to compute a rank-k matrix U in input-sparsity time. Todetermine this method, one may take the original construction of U,which is a series of products of various matrices, and reduce thedimensions of some of those matrices using the sparse subspaceembeddings in Clarkson. The computation of U may be viewed as thesolution of a generalized matrix approximation problem, to whichsketching concepts may be applied.

For a deterministic CUR algorithm, the CUR proto-algorithm may beimplemented in a deterministic way. Various operations may beimplemented deterministically with tools from conventional techniques. Amissing piece from the existing art is a deterministic version ofadaptive sampling, which is obtained in this disclosure by derandomizingthe adaptive sampling algorithms in Deshpande and Wang.

Some embodiments of the decomposition system 100 of this disclosureimplement a randomized CUR algorithm that runs in linear time. In someembodiments, the algorithm constructs C and R with rescaled columns androws from A. To convert this to a truly CUR decomposition, thedecomposition system 100 may keep the un-rescaled versions of C and R,and introduce the scaling factors in U. The analysis carries over tothat CUR version unchanged. The embodiment below need not provide thefastest possible CUR algorithm, but does provide a benefit in itssimplicity. A faster randomized algorithm is presented following thisone. The analysis of the algorithm in this section serves as a steppingstone for the input-sparsity-time and deterministic CUR algorithmspresented later.

Algorithm 2 is a randomized, linear-time, optimal, relative-error,rank-k CUR, that closely follows the CUR proto-algorithm (i.e.,Algorithm 1). Algorithm 2 may be implemented on a computer system inhardware, software, or a combination thereof. Various operationsperformed in this algorithm follow immediately below:

Input: A∈

^(m×n); rank parameter k<rank (A); and accuracy parameter 0<ε<1.

Output: C∈

^(m×n) with c=O(k/ε); Rε

^(r×n) with r=O(k/ε); and U∈

^(c×r), with rank(U)=k.

(1) Construct C with O(k+k/ε) columns. This includes the operations of(1a)-(1d).

(1a) Z₁=RandomizedSVD (A, k, 1); Z₁∈

^(n×k)(Z₁ ^(T)Z₁=I_(k)) E₁=A−AZ₁Z₁ ^(T)∈

^(m×n).

(1b) [Ω₁, D₁]=RandSampling(Z₁, h₁,1); h₁=16 k ln (20 k); Ω₁∈

^(n×h) ¹ ; D₁∈

^(h) ¹ ^(×h) ¹ .

M ₁ =Z ₁ ^(T)Ω₁ D ₁∈

^(k×h) ¹ . M ₁ −U _(M) ₁ Σ_(M) ₁ V _(M) ₁ ^(T) with rank(M ₁)=k and V_(M) ₁ ∈

^(h) ¹ ^(×k).

(1c) S₁=BssSampling(V_(M) ₁ , (E₁Ω₁D₁)^(T), c₁), with c₁=4 k. S₁∈

^(h) ¹ ^(×c) ¹ .

C ₁ =AΩ ₁ D ₁ S ₁∈

^(m×c) ¹ containing rescaled columns of A.

$\begin{matrix}{{C_{2} = {{AdaptiveCols}\left( {A,C_{1},1,c_{2}} \right)}},{{{with}\mspace{14mu} c_{2}} = \begin{matrix}{1620\; k} \\s\end{matrix}}} & \left( {1d} \right)\end{matrix}$

and C₂∈

^(m×c) ² with columns of A.

C=[C₁, C₂]∈

^(m×c) containing

$c = {{c_{1} + c_{2}} = {{4k} + \frac{1620k}{s}}}$

rescaled columns of A.

(2) Construct R with O(k+k/ε) rows. This includes the operations of(2a)-(2d).

(2a) [Y, Ψ, Δ]=BestSubspaceSVD(A, C, k); Y∈

^(m×c), Ψ∈

^(c×c), Δ∈

^(c×k); B=YΔ.

B=Z ₂ D is a qr-factorization of B with Z ₂∈

^(m×k)(Z ₂ ^(T) Z ₂ =I _(k)),D∈

^(k×k), and E ₂ =A ^(T) −A ^(T) Z ₂ Z ₂ ^(T).

(2b) [Ω₂, D₂]=RandSampling(Z₂, h₂,1); h₂=8 k ln (20 k); Ω₂∈

^(m×h) ² ; D₂∈

^(h) ² ^(×h) ² .

M ₂ =Z ₂ ^(T)Ω₂ D ₂∈

^(k×h) ² . M ₂ =U _(M) ₁ Σ_(M) ₂ V _(M) ₂ ^(T) with rank(M ₂)=k and V_(M) ₂ ∈

^(h) ² ^(×k).

(2c) S₂=BssSampling(V_(M) ₂ , (E₂Ω₂D₂)^(T), r₁), with r₁=4 k and S₂∈

^(h) ² ^(×r) ¹ .

R ₁=(A ^(T)Ω₂ D ₂ S ₂)^(T)∈

^(r) ¹ ^(×n) containing rescaled rows from A.

(2d) R₂=AdaptveRows(A, Z₂, R₁, r₂), with

  r₂ = ?  and  R₂ ∈ ℝ^(r₂ × n)?indicates text missing or illegible when filed

with rows of A.

R=[R₁ ^(T), R₂ ^(T)]^(T)∈

^((r) ² ^(+r) ² ^()×n) containing

$r = \left. {4\; k} \middle| \begin{matrix}{1620\; k} \\s\end{matrix} \right.$

rescaled rows of A.

(3) Construct U of rank k. This includes the operations of (3a).

(3a) Construct U∈

^(c×r) with rank at most k as follows, with these equivalent formulas:

$\begin{matrix}{U = {\Psi^{- 1}\Delta \; D^{- 1}Z_{2}^{T}{AR}^{\dagger}}} \\{= {\Psi^{- 1}\Delta \; {D^{- 1}\left( {C\; \Psi^{- 1}\Delta \; D^{- 1}} \right)}^{T}{AR}^{\dagger}}} \\{= {\Psi^{- 1}\Delta \; {D^{- 1}\left( {C\; \Psi^{- 1}\Delta \; D^{- 1}} \right)}^{\dagger}{AR}^{\dagger}}} \\{= {{\Psi^{- 1}\Delta \; D^{- 1}D\; \Delta^{T}\Psi \; C^{\dagger}{AR}^{\dagger}} - {\Psi^{- 1}{\Delta\Delta}^{T}\Psi \; C^{\dagger}{AR}^{\dagger}}}}\end{matrix}$

Algorithm 2 takes as input an m×n matrix A, rank parameter k<rank(A),and accuracy parameter C<ε<1. These are the same inputs as in the CURProblem. The algorithm may output matrix C∈

^(m×c) with c=O(k/ε) columns of A, matrix R∈

^(r×n) with r=O(k/ε) rows of A, and matrix U∈

^(c×r) with rank at most k. Algorithm 2 may closely follow the CURproto-algorithm that is Algorithm 1. However, Algorithm 2 may makespecific choices for the various operations of the proto-algorithm thatcan be implemented in linear time. Algorithm 2 may have threeoperations: (1) an optimal number of columns are selected in C; (2) anoptimal number of rows are selected in R; and (3) an intersection matrixwith optimal rank is constructed as U. The algorithm itself refers toseveral other algorithms, combined in a novel manner. Specifically, thealgorithm refers to RandomizedSVD, which is described in Lemma 3.4 ofBoutsidis; RandSampling, which is described in Rudelson (M. Rudelson andR. Vershynin; Sampling from large matrices: An approach throughgeometric functional analysis; JACM: Journal of the ACM, 54, 2007);BssSampling, which is described in Lemma 3.6 of Boutsidis; AdaptiveCols,which is described in Lemma 2.1 of Deshpande II (A. Deshpande, L.Rademacher, S. Vempala, and G. Wang; Matrix approximation and projectiveclustering via volume sampling; In Proceedings of the 17th AnnualACM-SIAM Symposium on Discrete Algorithms, pages 1117-1126, 2006);BestSubspaceSVD, which is described in Clarkson II (K. Clarkson and D.Woodruff; Numerical linear algebra in the streaming mode; In Proceedingsof the 41st annual ACM symposium on Theory of computing (STOC), 2009);and AdaptiveRows, which is described in Theorem 4 of Wang.

It can be shown that the total asymptotic running time of the abovealgorithm is as follows:

${O\left( {{n^{2}{k/ɛ}} + {m^{2}{k/ɛ}} + {{mk}^{2}/ɛ^{2}} + {k^{3}/ɛ^{3}} + {k^{4}\ln \; k} + {\frac{k}{ɛ}\log \frac{k}{ɛ}}} \right)}.$

With respect to an error bound, it can additionally be shown that thematrices C, U, and R in Algorithm 2 satisfy the following withprobability at least 0.2:

∥A−CUR∥ _(F) ²≦(1+20ε)∥A−A _(k)∥_(F) ².

Following, a CUR algorithm that runs in input-sparsity time is presentedand analyzed. More specifically, this algorithm constructs C and R withrescaled columns and rows from A. To convert this to a truly CURdecomposition, the decomposition system 100 may keep the un-rescaledversions of C and R and introduce the scaling factors in U. The analysiscarries over to that CUR version unchanged.

The algorithm description below closely follows the CUR proto-algorithmin Algorithm 1. Algorithm 3 may be implemented on a computer system inhardware, software, or a combination thereof. Various operationsperformed in this algorithm follow immediately below:

Input: A∈

^(m×n); rank parameter k<rank(A); and accuracy parameter 0<ε<1.

Output: C∈

^(m×c) with c=O(k/ε); R∈

^(r×n) with r=O(k/ε); U∈

^(c×r) with rank(U)=k.

Construct C with O(k+k/ε) columns. This includes the operations of(1a)-(1d).

(1a) 2=SparseSVD(A, k, 1); Z₁∈

^(n×k)(Z₁ ^(T)Z₁=I_(k)).

(1b) [Ω₁, D₁]=RandSampling(Z₁, h₁, 1); h₁=16 k ln (20 k); Ω₁∈

^(n×h) ¹ , D₁∈

^(h) ¹ ^(×h) ¹ .

M ₁ =Z ₁ ^(T)Ω₁ D ₁∈

^(k×h) ¹ . M ₁ =U _(M) ₁ Σ_(M) ₁ V _(M) ₁ ^(T) with rank(M ₁)=k and V_(M) ₁ ∈

^(h) ¹ ^(×k).

(1c) S₁=BssSamplingSparse(V_(M) ₁ , ((A−AZ₁Z₁ ^(T))Ω₁D₁)^(T), c₁, 0.5),with c₁=4 k. S₁∈

^(h) ¹ ^(×c) ¹ .

C ₁ =AΩ ₁ D ₁ S ₁∈

^(m×c) ¹ containing rescaled columns of A.

$\begin{matrix}{{{C_{2} = {{AdaptiveColsSparse}\left( {A,C_{1},1,c_{2}} \right)}};{c_{2} = \frac{402\; 0k}{s}};}{C_{2} \in {\mathbb{R}}^{m \times c_{2}}}} & \left( {1d} \right)\end{matrix}$

with columns of A.

C=[C₁ C₂]∈

^(m×c) containing

$c = {{c_{1} + c_{2}} = {{4\; k} + \frac{482\; 0k}{s}}}$

rescaled columns of A.

(2) Construct R with O(k+k/ε) rows. This includes the operations of(2a)-(2d).

(2a) [Y, Ψ, Δ]=ApproxSubspaceSVD(A, C, k); Y∈

^(m×c), Ψ∈

^(c×c), Δ∈

^(c×k); B=YΔ.

B=Z ₂ D is a qr decomposition of B with Z₂∈

^(m×k)(Z ₂ ^(T) Z ₂ −I _(k)),D∈

^(k×k).

(2b) [Ω₂, D₂]=RandSampling(Z₂, h₂,1); h₂=8 k ln (20 k); Ω₂∈

^(m×h) ² ; D₂∈

^(h) ² ^(×h) ² .

M ₂ =Z ₂ ^(T)Ω₂ D ₂∈

^(k×h) ² . M ₂ =U _(M) ₁ Σ_(M) ₁ V _(M) ₁ ^(T) with rank(M ₂)=k and V_(M) ₂ ∈

^(h) ² ^(×k).

(2c) S₂=BssSamplingSparse(V_(M) ₂ , ((A^(T)−A^(T)Z₂Z₂ ^(T))Ω₂D₂)^(T),r₁,0.5), with r₁=4 k. S₂∈

^(h) ² ^(×r) ¹ .

R₁=(A^(T)Ω₂D₂S₂)^(T)∈^(r) ² ^(×n) containing rescaled rows from A.

$\begin{matrix}{{{R_{2} = {{AdaptiveRowsSparse}\left( {A,Z_{2},R_{1},r_{2}} \right)}};{r_{2} = \frac{482\; 0\; k}{s}};}{R_{2} \in {\mathbb{R}}^{r_{2} \times m}}} & \left( {2d} \right)\end{matrix}$

with rows of A.

R=[R₁ ^(T), R₂ ^(T)]^(T)∈

^((r) ¹ ^(+r) ² ^()×n) containing

$r = {{4\; k} + \frac{4820k}{s}}$

rescaled rows of A.

(3) Construct U of rank k. This includes the operations of (3a).

(3a) Let W∈

^(ξ×m) be a randomly chosen sparse subspace embedding with ξ=Ω(k²ε⁻²).Then U=Ψ⁻¹ΔD⁻¹(WCΨ⁻¹ΔD⁻¹)^(†)WAR^(†)=Ψ⁻¹ΔΔ^(T)(WC)^(†)WAR^(†).

Algorithm 3 takes as input an m×n matrix A, rank parameter k<rank(A),and accuracy parameter 0<ε<1. These are the same inputs as in the CURProblem. The algorithm may return matrix C∈

^(m×c) with c=O(k/ε) columns of A, matrix R∈

^(r×n) with r=O(k/ε) rows of A, and matrix U∈

^(c×r) with rank at most k. Although Algorithm 3 follows closely the CURproto-algorithm in Algorithm 1, this algorithm may make specific choicesfor the various actions of the proto-algorithm that can be implementedin input-sparsity time.

In some embodiments, Algorithm 3 may run as follows: (1) An optimalnumber of columns are selected in C. (2) An optimal number of rows areselected in R. And (3) an intersection matrix with optimal rank isconstructed and denoted as U. As provided above, the algorithm itselfrefers to several other algorithms, including new algorithms as well asexisting ones combined in a novel manner. Specifically, the algorithmrefers to SparseSVD, which is described in Theorem 47 of Clarkson III(K. L. Clarkson and D. P. Woodruff; Low rank approximation andregression in input sparsity time; 2013); RandSampling, which isdescribed in Rudelson; BssSampling, which is described in Lemma 3.6 ofBoutsidis; AdaptiveColsSparse, which is described in Lemma 3 herein;ApproxSubspaceSVD, which is described in Kannan (R. Kannan, S. Vempala,and D. P. Woodruff; Nimble algorithms for cloud computing. arXivpreprint arXiv: 1304.3162, v3, 2013); and AdaptiveRowsSparse, which isdescribed in Lemma 4 herein.

FIG. 3 is a flow diagram of a method 300 for finding a CUR decompositionof a matrix A in input-sparsity time, using Algorithm 3, according tosome embodiments of this disclosure. At a high level, matrix C may beconstructed with O(k/ε) columns at block 310, matrix R may beconstructed with O(k/ε) rows at block 320, and matrix U may beconstructed from matrices C and R at block 330. The decomposition method300 of FIG. 3 is similar to the decomposition method 200 of FIG. 2, withsome exceptions. As a first exception, at blocks 316 and 326, aninput-sparsity-time version of the BSS sampling technique may be used,as will be described further below. As a second exception, at blocks 318and 328, the adaptive sampling performed may be an input-sparsity-timeversion, as will be described further below. Additionally, theconstructions of U at block 330 may run in input-sparsity time.

It can be shown that the total running time of the above algorithm is asfollows:

O(nnz(A)log n+(m+n)·poly(log n,k,1/ε)).

With respect to a quality-of-approximation result (i.e., error bounds)of Algorithm 3, it can also be shown that the matrices C, U, and R inAlgorithm 3 satisfy, with probability at least 0.16−2/n,

∥A−CUR∥ _(F) ²≦(1+ε)(1+60ε)∥A−A _(k)∥_(F) ².

The above Algorithm 3 for an input-sparsity-time CUR decomposition mayinclude other algorithms, some of which exist in the art and some ofwhich are newly provided in this disclosure. Below, those that are neware described in detail.

For use in this algorithm, an input-sparsity-time version of theso-called “BSS sampling” algorithm of Batson may be developed, as anextension of the original BSS algorithm that appeared in Boutsidis.

The following is referred to herein as Lemma 2, an input-sparsity-time,dual-set, spectral-frobenius sparsification: Let

={ν₁, . . . , ν_(n)} be a decomposition of the identity, where ν_(i)∈

^(k)(k<n) and Σ_(i=1) ^(n)ν_(i)ν₁ ^(T)=I_(k). Let

={α₁, . . . , α_(n)} be an arbitrary set of vectors, where α_(i)∈

^(l). Let W∈

^(ξ×i) be a randomly chosen sparse subspace embedding with ξ=O(n²/ε²)<l,for some 0<ε<1. Consider a new set of vectors

={Wα₁, . . . , Wα_(n)}, with Wα_(i)∈

^(ξ). Run the algorithm of Lemma 3.6 of Boutsidis, with

={

₁, . . . ,

_(n)},

={Wα₁, . . . , Wα_(n)}, and some integer r such that k<r≦n. Let theoutput of this be a set of weights s_(i)≧0(t=1 . . . n), at most r ofwhich are non-zero Then, with probability at least 0.98,

${{\lambda_{k}\left( {\sum\limits_{i = 1}^{n}\; {s_{i}v_{i}v_{i}^{T}}} \right)} \geq \left( {1 - \sqrt{\frac{k}{r}}} \right)^{2}},{{{{Tr}\left( {\sum\limits_{i = 1}^{n}\; {s_{i}a_{i}a_{i}^{T}}} \right)} \leq {\frac{1 + ɛ}{1 - ɛ} \cdot {{Tr}\left( {\sum\limits_{i = 1}^{n}\; {a_{i}a_{i}^{T}}} \right)}}} = {\frac{1 + ɛ}{1 - ɛ} \cdot {\sum\limits_{i = 1}^{n}\; {{a_{i}}_{2}^{2}.}}}}$

Equivalently, if V∈

^(n×k) is a matrix whose rows are the vector

_(i) ^(T), A∈

^(n×l) is a matrix whose rows are the vectors α_(i) ^(T), B=AW^(T)∈

^(n×ξ) is a matrix whose rows are the vectors u_(i) ^(T)W^(T), and S∈

^(n×r) is the sampling matrix containing the weights s_(i)≧0, then withprobability at least 0.98,

${{\sigma_{k}\left( {V^{T}S} \right)} \geq {1 - \sqrt{\frac{k}{r}}}},\mspace{14mu} {{{A^{T}S}}_{F}^{2} \leq \frac{1 + ɛ}{1 - ɛ}},{{A}_{F}^{2}.}$

The weights s_(i) can be computed in O(nnz(A)+rnk²+nξ) time. Theprocedure is denoted herein as S=BssSamplingSparse(V, A, rε).

Lemma 2 may be proven as follows: The algorithm constructs s asS=BssSampling(V, B, r). The lower bound for the smallest singular valueof V is immediate from Lemma 3.6 of Boutsidis. That lemma also ensuresthat

∥D ^(T) S∥ _(F) ² ≦∥D ^(T)∥_(F) ²,

i.e.,

∥WA ^(T) S∥ _(F) ² ≦∥WA ^(T)∥_(F) ²

Since W is a subspace embedding, from Clarkson III, it is the case that,with probability at least 0.99, and for all vectors y∈

^(n) simultaneously,

(1−ε)∥A ^(T) y∥ ₂ ² ≦∥WA ^(T) y∥ ₂ ².

Apply this r times for y∈

^(n) being columns from S∈

^(n×r), and take a sum on the resulting inequalities,

(1−ε)∥A ^(T) S∥ _(F) ² ≦∥WA ^(T) S∥ _(F) ².

From Lemma 40 of Clarkson III, it is the case that, with probability atleast 0.99,

∥WA ^(T)∥_(F) ²≦(1+ε)∥A ^(T)∥_(F) ².

Combining these inequalities together, it can be concluded that, withprobability at least 0.98,

${{{A^{T}S}}_{F}^{2} \leq \frac{1 + ɛ}{1 - ɛ}},{{A^{T}}_{F}^{2}.}$

Following are developed input-sparsity-time versions of certain adaptivesampling algorithms. The lemma below corresponds to a fast version ofthe adaptive sampling algorithm of Theorem 2.1 of Deshpande II.

The following is referred to herein as Lemma 3: Given A∈

^(m×n) and V∈

^(m×c) ¹ (with c₁≦n, m), there exists a randomized algorithm toconstruct C₂∈

^(m×c) ² containing c₂ columns of A, such that the matrix C=[V C₂]⊂

^(m×(c) ¹ ^(+c) ² ⁾ containing the columns of V and C_(z) satisfies, forany integer k>0, and with probability

${0.9 - \frac{1}{n}},{{{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{30\; k}{c_{2}}{{{A - {{VV}^{\dagger}A}}}_{F}^{2}.}}}}$

The above procedure is denoted as C₂=AdaptiveColsSparse(A, V, c₂). GivenA and V. the algorithm takes O(nnz(A)log n+mc₁ log n+mc₁ ²) time to findC₂.

Lemma 3 may be proven as follows: Define the residual D=A−VV^(†)A∈

^(m×n). From Theorem 1 (for fixed ε=0.5) of Achlioptas (D. Achlioptas;Database-friendly random projections: Johnson-Lindenstrauss with binarycoins; J. Comput. Syst. Sci., 66(4):671-687, 2003), let

{tilde over (B)}=JLT(B,1).

The lemma give

${{\overset{\sim}{b}}_{i}}_{2}^{2} \geq {\frac{1}{2}{{b_{i}}_{2}^{2}.}}$

So from Theorem 2.1 of Deshpande II, after using the followingdistribution for the sampling,

${p_{i} = {{\frac{{{\overset{\sim}{b}}_{i}}_{2}^{2}}{{\overset{\sim}{B}}_{F}^{2}} \geq {\frac{1}{2} \cdot \frac{2}{3} \cdot \frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}} = {\frac{1}{3}\frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}}},$

one obtains

${\left\lbrack {{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} \right\rbrack} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{3\; k}{c_{2}}{{{A - {{VV}^{\dagger}A}}}_{F}^{2}.}}}$

The expectation is taken with respect to the randomness in constructingC₂, so

${{\left\lbrack {{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} \right\rbrack} - {{A - A_{k}}}_{F}^{2}} \leq {\frac{3\; k}{c_{2}}{{{A - {{VV}^{\dagger}A}}}_{F}^{2}.}}$

The following relation is immediate:

∥Λ−Π_(C,k) ^(F)(Λ)∥_(F) ²−∥Λ−Λ_(k)∥_(F) ²>0,

so, from Markov's inequality, with probability at least 0.9,

${{{{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} - {{A - A_{k}}}_{F}^{2}} \leq {\frac{30\; k}{c_{2}}{{A - {{VV}^{\dagger}A}}}_{F}^{2}}},$

which implies

${{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{30\; k}{c_{2}}{{{A - {{VV}^{\dagger}A}}}_{F}^{2}.}}}$

The running time follows by a careful implementation of the randomprojection step inside the routine in Theorem 1 of Achlioptas. Thematrices SA and SV are computed in O(nnz(A)log n) and O(mc₁ log n)arithmetic operations, respectively. Computing V^(†) requires O(mc₁ ²)operations, and (SV)V takes another O(mc₁ log n) operations. Computing(SVV^(†))A takes O(nnz(A)log n) operations, and computing SA−SVV^(†)Atakes another O(n log n) operations. Thus, these steps can beimplemented in total time O(nnz(A)log n+mc₁ log n+mc₁ ²).

Following is developed an input-sparsity-time version of the adaptivesampling algorithm of Theorem 4 in Wang.

The following is referred to herein as Lemma 4: Given A∈

^(m×n) and A−

^(m×n) such that

rank(V)=rank(VV ^(†) A)=ρ,

with ρ≦c≦n, let R₁∈

^(r) ¹ ^(×n) consist of r₁ rows of A. There exists a randomizedalgorithm to construct R₂∈

^(r) ² ^(×n) with r₂ rows such that, for R=[R₁ ^(T), R₂ ^(T)]^(T)∈

^((r) ¹ ^(+r) ² ^()×n), with probability at least 0.9−1/n,

${{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {{VV}^{\dagger}A}}}_{F}^{2} + {\frac{30\; \rho}{r_{2}}{{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}.}}}$

This procedure is denoted as R₂=AdaptiveRowsSparse(A, V, R₁, r₂). GivenA, V, R₁, the algorithm takes

O(nnz(A)log n+nr ₁ log n+nr ₁ ²)

arithmetic operations to find R₂.

Lemma 4 may be proven as follows: There is an immediate generalizationof Theorem 15 of Wang to the case when the sampling probabilitiessatisfy

${p_{i} \geq {\alpha \frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}},$

for some α<1, rather than just for α=1. This leads to the followingversion of Theorem 4 of Wang described herein: if the probabilities inthat lemma satisfy

${p_{i} \geq {\alpha \frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}},$

for some α<1, then the bound is

$\left\lbrack {{{{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {{VV}^{\dagger}A}}}_{F}^{2} + {\frac{\rho}{\alpha \; r_{2}}{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}}}},} \right.$

Given this bound, the proof continues by repeating the ideas used inLemma 3. Although the proof is similar to that in Lemma 3, a proof isincluded herein for completeness.

Define the residual

B=A−AR ₁ ^(†) R ₁∈

^(m×n).

From Theorem 1 in Achlioptas, let

B=JLT(B,1).

Given that, it is the case that

${{\overset{\sim}{b}}_{i}}_{2}^{2} \geq {\frac{1}{2}{{b_{i}}_{2}.}}$

Using for the sampling the distribution

${p_{i} = {{\frac{{{\overset{\sim}{b}}_{i}}_{2}^{2}}{{\overset{\sim}{B}}_{F}^{2}} \geq {\frac{1}{2} \cdot \frac{2}{3} \cdot \frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}} = {\frac{1}{3}\frac{{b_{i}}_{2}^{2}}{{B}_{F}^{2}}}}},$

one may obtain

$\left\lbrack {{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} \right\rbrack \leq {{{A - {{VV}^{\; \dagger}A}}}_{F}^{2} + {\frac{3\; \rho}{r_{2}}{{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}.}}}$

The expectation is taken with respect to the randomness in constructingR₂, so

${\left\lbrack {{{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} - {{A - {{VV}^{\dagger}A}}}_{F}^{2}} \right\rbrack} \leq {\frac{3\; \rho}{r_{2}}{{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}.}}$

It follows that

∥A−VV ^(†) AR ^(†) R∥ _(F) ² −∥A−VV ^(†) A∥ _(F) ²>0

since

∥Λ−VV ^(†) ΛR ^(†) R∥ _(F) ² −∥Λ−VV ^(†) Λ+VV ^(†) Λ−VV ^(†) AR ^(†) R∥_(F) ² =∥A−VV ^(†) A∥ _(F) ² +∥VV ^(†) A−VV ^(†) AR ^(†) R∥ _(F) ²

Hence, by Markov's inequality, with probability at least 0.9,

${{{{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} - {{A - {{VV}^{\dagger}A}}}_{F}^{2}} \leq {{+ \frac{30\; \rho}{r_{2}}}{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}}},$

which implies

${{A - {{VV}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {{VV}^{\dagger}A}}}_{F}^{2} + {\frac{30\; \rho}{r_{2}}{{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}.}}}$

The running time follows by a careful implementation of the randomprojection step inside the routine in Theorem 1 of Achlioptas. ComputingSA may occur in O(nnz(A)log n) arithmetic operations. Computing R₁ ^(†)requires O(nr₁ ²) operations. Computing ((SA)V^(†)) R₁ requires anotherO(nr₁ log n) operations. Computing SA−SAR₁ ^(†)R₁ requires O(n log n)operations. Thus, all these operations can be implemented in O(nnz(A)logn+nr₁ log n+nr₁ ²) time.

Below, a deterministic, polynomial-time CUR algorithm is presented inAlgorithm 4. More precisely, the algorithm constructs C and R withrescaled columns and rows from A. To convert this to a truly CURdecomposition, one may keep the un-rescaled versions of C and R andintroduce the scaling factors in U.

The algorithm description below closely follows the CUR proto-algorithmin Algorithm 1. Algorithm 4 may be implemented on a computer system inhardware, software, or a combination thereof. Various operationsperformed in this algorithm follow immediately below:

Input: A∈

^(m×n); rank parameter k<rank(A); and accuracy parameter 0<ε<1.

Output: C∈

^(m×c) with c=O(k/ε); R∈

^(r×n) with r=O(k/ε); U∈

^(c×r) with rank(U)=k.

(1) Construct C with O(k+k/k) columns. This includes operations(1a)-(1d).

(1a) Z₁=DeterministicSVD(A, k, 1); Z₁∈

^(n×k)(Z₁ ^(T)Z₁=I_(k)); E₁=A−AZ₁Z₁ ^(T)∈

^(m×n).

(1b) S₁=BssSampling(V_(M) ₁ , E₁ ^(T), c₁), with c₁=4 k. S₁∈

^(h) ¹ ^(×c) ¹ .

C ₁ =AS ₁∈

^(m×c) ¹ containing rescaled columns of A.

(1c) C₂=AdaptiveColsD(A, C₁,1, c₂), with

$c_{2} = {{\frac{10\; k}{s}\mspace{14mu} {and}\mspace{14mu} C_{2}} \in {\mathbb{R}}^{m \times c_{2}}}$

with columns of A.

C=[C₁ C₂]∈

^(m×c) containing

$c = {{c_{1} + c_{2}} = {{4\; k} + \frac{10\; k}{s}}}$

rescaled columns of A.

(2) Construct R with O(k+k/ε) rows. This includes the operations of(2a)-(2d).

(2a) [Y, Ψ, Δ]=BestSubspaceSVD(A, C, k); Y∈

^(m×c), Ψ∈

^(c×c), Δ∈

^(c×k); B=YΔ.

B=Z ₂ D is a qr-factorization of B with Z₂∈

^(m×k)(Z ₂ ^(T) Z ₂ =I _(k)),D∈

^(k×k), and E ₂ =A ^(T) −A ^(T) Z ₂ Z ₂ ^(T).

(2b) S₂=BssSampling(V_(M) ₁ , (E₂)^(T), r₁), with r₁=4 k and S₂∈

^(h) ¹ ^(×r) ¹ .

R ₁=(A ^(T) S ₂)^(T)∈^(r) ¹ ^(×n) containing rescaled rows from A.

(2c) R₂=AdaptiveRowsD(A, Z₂, R₁, r₂), with

$r_{2} = {{\frac{10\; k}{z}\mspace{14mu} {and}\mspace{14mu} R_{2}} \in {\mathbb{R}}^{r_{2} \times n}}$

with rows of A.

R=[R₁ ^(T), R₂ ^(T)]T∈

^((r) ¹ ^(+r) ² ^()×n) containing

$r = \left. {4\; k} \middle| \begin{matrix}{10\; k} \\s\end{matrix} \right.$

rescaled rows of A.

(3) Construct U of rank k. This includes the operations of (3a).

(3a) Construct U∈

^(c×r) with rank at most k as follows, where the following formulas areequivalent:

$\begin{matrix}{U = {\Psi^{- 1}\Delta \; D^{- 1}Z_{2}^{T}{AR}^{\dagger}}} \\{= {{\Psi^{- 1}\Delta \; {D^{- 1}\left( {C\; \Psi^{- 1}\Delta \; D^{- 1}} \right)}^{T}{AR}^{\dagger}} -}} \\{{\Psi^{- 1}\Delta \; {D^{- 1}\left( {C\; \Psi^{- 1}\Delta \; D^{- 1}} \right)}^{\dagger}{AR}^{\dagger}}} \\{= {\Psi^{- 1}\Delta \; D^{- 1}D\; \Delta^{T}\Psi \; C^{\dagger}{AR}^{\dagger}}} \\{= {\Psi^{- 1}{\Delta\Delta}^{T}\Psi \; C^{\dagger}{{AR}^{\dagger}.}}}\end{matrix}\quad$

Algorithm 4 may take as input an m×n matrix A, rank parameter k<rank(A),and accuracy parameter 0<ε<1. These are also the inputs of the CURProblem. The algorithm may return matrix C∈

^(m×c) with c=O(k/ε) columns of A, matrix R∈

^(r×n) with r=O(k/ε) rows of A, and matrix U∈

^(c×r) with rank at most k. Algorithm 4 may closely follow the CURproto-algorithm of Algorithm 1. More particularly, Algorithm 4 may makespecific choices for the various operations of the proto-algorithm thatcan be implemented deterministically in polynomial time.

In some embodiments, Algorithm 4 may run as follows: (1) An optimalnumber of columns may be selected in C. (2) An optimal number of rowsmay be selected in R. And (3) an intersection matrix with optimal rankmay be constructed and denoted as U. The algorithm itself may integrateone or more other algorithms, including new algorithms as well asexisting ones combined in a novel manner. Specifically, Algorithm 4refers to DeterministicSVD, which is described in Theorem 3.1 inGhashami (M. Ghashami and J. Phillips; Relative errors for deterministiclow-rank matrix approximations; In SODA, 2013); BssSampling, which isdescribed in Lemma 3.6 of Boutsidis; AdaptiveColsD, which is describedin Lemma 5; BestSubspaceSVD, which is described in Clarkson II; andAdaptiveRowsD, which is described in Lemma 6.

FIG. 4 is a flow diagram of a method 400 for finding a deterministic CURdecomposition of a matrix A, using Algorithm 4, according to someembodiments of this disclosure. At a high level, matrix C may beconstructed with O(k/ε) columns at block 410, matrix R may beconstructed with O(k/ε) rows at block 420, and matrix U may beconstructed from matrices C and R at block 430. The decomposition method400 of FIG. 4 is similar to the decomposition method 200 of FIG. 2,except that, as shown at blocks 418 and 428, the adaptive samplingperformed may be derandomized, as will be described further below.

It can be shown that the above algorithm runs in a total time ofO(mn³k/ε).

With respect to error bounds, it can be further shown that matrices C,U, and R in Algorithm 4 satisfy,

∥A−CUR∥ _(F) ² ≦∥A−A _(k)∥_(F) ²+0ε∥A−A _(k)∥_(F) ².

The above Algorithm 4 may include other algorithms, some of which existin the art and some of which are newly provided in this disclosure.Below, those that are new are described in detail. As a starting point,below, a lemma is proven regarding derandomizing the adaptive samplingprocedure of Deshpande II. The lemma offers a deterministic adaptivesampling algorithm used in the above deterministic CUR algorithm.

The following is referred to herein as Lemma 5: Given A∈

^(m×n) and V∈

^(m×c) ¹ (with c₁≦n, m), let the residual be defined as

B=A−VV ^(†) A∈

^(m×n).

There exists a deterministic algorithm to construct C₂∈

^(m×c) ² containing c₂ columns of A, such that C−[V C₂]∈

^(m×(n) ¹ ^(+n) ² ⁾ i.e., the matrix that contains the columns of V andC₂, satisfies, for any integer k>0,

${{A - {\prod\limits_{C,k}^{F}\; (A)}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{4k}{c_{2}}{{{A - {{VV}^{\dagger}A}}}_{F}^{2}.}}}$

The above procedure is denoted as C₂=AdaptiveColsD(A, V, c₂). Given Aand V, the algorithm requires O(mn⁸c) arithmetic operations to find C₂.Here c=c₁+c₂.

Lemma 5 may be proven as follows: This lemma corresponds to aderandomization of Theorem 2.1 of Deshpande II. It suffices toderandomize Theorem 2.1 of Deshpande II, since the above lemma isequivalent to the theorem by taking transposes. This disclosure nowswitches to the notation to that of Deshpande II for the sake ofclarity. In that theorem, E=A−AVV^(T), and a random sample S of s rowsof A are chosen from a distribution

with

p _(i) =∥E ^((i))∥₂ ² /∥E∥ _(F) ²,

for each i∈[m], and independently for each of the s trials. It will beunderstood that s corresponds to c₂ in the notation herein. It is shownin Deshpande II that if pp^(T) is a projection operator onto the spacespanned by the s sampled rows together with the rows of V, then there isa rank-k approximation to A, denoted A, in the row space of P for which

${E_{S}{{A - A^{\prime}}}_{P}^{2}} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{k}{s}{{E}_{F}^{2}.}}}$

To derandomize Theorem 2.1 of Deshpande II, the distribution p isdiscretized, defining a new distribution q. Let i*∈[n] be such thatp_(i)*≧p_(i) for all i≠i*. Then, let

${r_{i^{*}} = {p_{i^{*}} + {\sum\limits_{i \neq i^{*}}{p_{i}/2}}}};$

hence, r is a distribution. Round each r_(ε), i≠i*, up to the nearestinteger multiple of 1/(4n) by adding γ_(i), and let q_(i) be theresulting value. For q_(i)*, let q_(i)*=r_(i)*−Σ_(i1=i)*γ_(i). ThenΣ_(i)q_(i)=1 and 0≦q_(i)≦1 for all i. Consider q_(i)*. Then,

${0 \leq {\sum\limits_{i \neq i^{*}}\gamma_{i}}<={n*\left( {1/\left( {4n} \right)} \right)}} = {1/4.}$

On the other hand

$r_{i^{*}} = {{{p_{i^{*}} + {\sum\limits_{i \neq i^{*}}{p_{i}/2}}} \geq {\sum\limits_{i}{p_{i}/2}}} = {1/2.}}$

Hence q_(i)*∈[0,1] as well. Hence q is a distribution. It should benoted that, for all i≠i*, q_(i)≧r_(i)≧p_(i)/2. Moreover,q_(i)*≧r_(i)*−1/4≧1/4, so q_(i)*≧p_(i)*/4. Thus, all q_(i) are within afactor of 4 from p_(i): q_(i)≧p_(i)/4.

Following, as compared to the proof of Theorem 2.1 of Deshpande II,distribution

is replaced with

. As in the proof of Deshpande II, vectors w⁽¹⁾, . . . , w^((k)) aredefined in the column span of P such that there exists a rank-kapproximation to A, denoted A′, whose rows are spanned by these vectorsand for which

${E_{S}{{A - A^{\prime}}}_{P}^{2}} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{k}{s}{{E}_{F}^{2}.}}}$

This suffices to prove the theorem.

Define

${X_{}^{(j)} = \frac{u_{i}^{(j)}E^{(i)}}{q_{i}}},$

with probability q_(i), where l∈[s] and i∈[m], E^((i)) denotes the i-throw of E, and u^((j)) is the j-th left singular vector of A. As in theproof of Deshpande II, define

$X^{(j)} = {\frac{1}{s}{\sum\limits_{ = 1}^{s}{X_{}^{(j)}.}}}$

In that case, E_(S)[X^((i))]=E^(T)u^((i)). Definew^((i))=VV^(T)A^(T)u^((j))∥X^((i)), so that E_(S)[w^((j))]=σ_(j)ν^((j)),where σ_(j) is the j-th singular value of A, and ν^((j)) the j-th rightsingular vector. The same calculation as in Equations 2.4 and 2.5 of theproof of Deshpande II gives

${{E_{S}{{w^{(j)} - {\sigma_{j}v^{(j)}}}}^{2}} = {{\frac{1}{s^{2}}{\sum\limits_{ = 1}^{s}{E\left\lbrack {X_{}^{(j)}}^{2} \right\rbrack}}} - {\frac{1}{s}{{E^{T}u^{(j)}}}^{2}}}},$

as in Equation 2.6 of Deshpande II, where pairwise independence sufficesfor this aspect.

Based on the above, it is given that

${E_{S}{X_{}^{(j)}}_{2}^{2}} = {\sum\limits_{i = 1}^{m}{q_{i}\frac{{{u_{i}^{(j)}E^{(i)}}}^{2}}{q_{i}^{2}}}}$

$\leq {\frac{{{u_{i^{*}}^{(j)}E^{(i^{*})}}}_{2}^{2}}{\frac{1}{4}\frac{{E^{(i^{*})}}_{2}^{2}}{{E}_{F}^{2}}} + {\sum\limits_{i \neq i^{*}}\frac{{{u_{i}^{(j)}E^{(i)}}}_{2}^{2}}{\frac{{E^{(i)}}_{2}^{2}}{{E}_{F}^{2}}}}} \leq {4{{E}_{F}^{2}.}}$

This is a slightly weaker upper bound than using distribution p, whichDeshpande II shows in its Equation 2.8 achieves E_(S)∥X_(i) ^((j))∥₂²≦∥E∥_(F) ². The constant 4 makes little difference for this disclosurepurposes, while the fact that q is discrete assists withderandomization.

Equation 2.8 of Deshpande II changes toE_(S)∥w^((j))−σ_(j)ν^((j))∥²≦4/Z∥E∥_(F) ². The rest of the proof is asin Deshpande II, showing that, if pp^(T) is a projection operator ontothe space spanned by the s sampled rows together with the columns of V,then there is a rank-k approximation of A, denoted A′ in the columnspace of P for which

${E_{S}{{A - A^{\prime}}}_{P}^{2}} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{4k}{s}{{E}_{P}^{2}.}}}$

In particular there exists a choice of 5S for which, for thecorresponding A′, it is the case that

${{A - A^{\prime}}}_{P}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\frac{4k}{s}{{E}_{F}^{2}.}}}$

Distribution q is a discrete distribution, and the trials need only bepairwise independent. Let h:[c₂]→[4n] be drawn from apairwise-independent hash function family

. In that case,

need only have size O(n²), as described in Carter (L. Carter and M. N.Wegman; Universal classes of hash functions. J. Comput. Syst. Sci.,18(2):143-154, 1979). For each trial l∈[s], h(l) is computed. Since theprobabilities q_(i), where i=1, . . . , n, are integer multiples of1/(4n) and Σ_(i)q_(i)=1, the largest i∈[n] for which h(l)/(4n)≦Σ_(i′=1)^(i) q_(i) is selected. The probability of picking i in each trial istherefore q_(i). For each h∈

, the corresponding sample set S is computed, along with the best rank-kapproximation A′ of A in the space spanned by the s sampled rowstogether with the rows of V. ∥A−A′∥_(F) is then computed. The s with thesmallest such value is chosen. Computing this value can be done inO(mn(c₁+c₂)+n(c₁+c₂)2+mn) time, and since |

|=O(n²), the overall time is O(mn³(c₁+c₂)).

Following is a proof for a lemma derandomizing the recent row/columnadaptive sampling method of Wang, which is stated in Theorem 4 therein.The following lemma in general does not necessarily imply the previouslemma due to the fact that the previous lemma concerned approximating Aby a rank-k matrix, whereas the following lemma concerns approximating Aby a matrix of rank ρ, which is in general different than k. By settingV=A_(k) and getting the transpose version of the result, one gets only aversion of the previous lemma with A−CC^(†)A in the left hand side.Nevertheless, the derandomization arguments are similar.

The following is referred to herein as Lemma 6: Given A∈

^(m×n) and V∈

^(m×c), such that

rank(V)=rank(VV ^(†) A)=ρ,

with ρ≦c<n, let R₁∈

^(r) ¹ ^(×n) contain r₁ rows of A. There exists a deterministicalgorithm to construct R₂∈

^(r) ² ^(×n) with r₂ rows such that for

R=[R ₁ ^(T) ,R ₂ ^(T)]^(T)∈

^((r) ¹ ^(×r) ² ^()×n),

it is the case that

${{A - {{VV}^{\dagger}{AR}^{\dagger}R{_{F}^{2}}A} - {{VV}^{\dagger}A}}}_{F}^{2} + {\frac{4\rho}{r_{2}}{{{A - {{AR}_{1}^{\dagger}R_{1}}}}_{F}^{2}.}}$

The above procedure is denoted herein as R₂=AdaptiveRowsD(A, V, R₁, r₂).Given A, V, and R₁, it takes O(n²(mc²+nr²+mnc+mnr)) time to find R₂.Here c=c₁+c₂, r=r₁+r₂.

Lemma 6 may be proven as follows: This lemma corresponds to aderandomization of Theorem 5 of Wang. The matrix C in the proof of Wangcorresponds to the matrix V herein. Despite the notation of Wang, asthat reference states in its theorem statement and proves, the matrix Cneed not be a subset of columns of A. To prove the theorem, Wangactually restates it as Theorem 15 in that paper, switching the role ofrows and columns, which will now be derandomized herein. Lemma 6 hereinfollows by taking transposes.

In that theorem of Wang, the authors define a distribution

on

columns b₁, . . . , b_(n) where B=A−C₁C₁ ^(†)A for a matrix C₁∈

^(m×c) ¹ having c₁ columns of A. In their, proof they show that, forp_(i)=∥b_(i)∥₂ ²/∥B∥_(F) ², if one samples c₂ columns of A inindependent and identically distributed trials, where in each trial thei-th column is chosen with probability

_(i), then if C₂∈

^(m×c) ² has the c₂ sampled columns, and C−[C₁,C₂]∈

^(m×(n) ¹ ^(+n) ² ⁾, then

${E_{C_{2}}{{A - {{CC}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2}} \leq {{{A - {{AR}^{\dagger}R}}}_{F}^{2} + {\frac{\rho}{c_{2}}{{{A - {C_{1}C_{1}^{\dagger}A}}}_{F}^{2}.}}}$

To derandomize Theorem 5 of Wang, this disclosure discretizes thedistribution p, defining a new distribution q. Let i*∈[n] be such thatp_(i)*≧p_(i), for all i≠i*. Then, let

${r_{i^{*}} = {p_{i^{*}} + {\sum\limits_{i = i^{*}}{p_{i}/2}}}};$

hence, r is a distribution. Round each r_(i), i≠i*, up to the nearestinteger multiple of 1/(4n) by adding γ_(i), and let q_(i) be theresulting value. For q_(i)*, let q_(i)*=r_(i)*−Σ_(i1=i)*γ_(i). Itfollows that Σ_(i)q_(i)=1 and 0≦q_(i)≦1 for all i. When q_(i)* isconsidered, then 0≦Σ_(i≠i)*γ_(i)<=n*(1/(4n))=1/4. On the other hand,r_(i)*=p_(i)*+Σ_(i≠i)*p_(i)/2≧Σ_(i)p_(i)/2=1/2. Hence, q_(i)*∈[0,1] aswell. Thus, q is a distribution. For all i≠i*, q_(i)≧r_(i)≧p_(i)/2.Moreover, q_(i)*≧r_(i)*−1/4≧1/4, so q_(i)*≧p_(i)*/4. Therefore, allq_(i) are within a factor of 4 from p_(i): q_(i)≧p_(i)/4.

Follow the argument in the proof of Theorem 15 of Wang, distribution pis replaced with q below, pointing out the differences. Define therandom variable

${x_{j,{()}} = {\frac{v_{i,j}}{q_{i}}b_{i}}},$

where

_(i,j) is the (i,j)-th entry of V_(AR) _(†) _(Rρ)∈

^(n×ρ), where V_(AR) _(†) _(Rρ)∈

^(n×ρ) denotes the matrix of the top p right singular vectors ofAR^(†)R. It is the case that

$\left\lbrack x_{j,{()}} \right\rbrack = {{\sum\limits_{i = 1}^{n}{q_{i}\frac{v_{i,j}}{q_{i}}b_{i}}} = {{Bv}_{j}.}}$

Moreover,

${E{x_{j,{()}}}_{2}^{2}} = {{\sum\limits_{i = 1}^{n}{q_{i}\frac{v_{i,j}^{2}}{q_{i}^{2}}{b_{i}}_{2}^{2}}} \leq {\frac{v_{i^{*},j}^{2}{b_{i^{*}}}_{2}^{2}}{\frac{{b_{i^{*}}}^{2}}{4{B}_{F}^{2}}} + {\sum\limits_{i \neq i^{*}}\frac{v_{i,j}^{2}{b_{i}}_{2}^{2}}{\frac{{b_{i^{*}}}^{2}}{{B}_{F}^{2}}}}} \leq {4{{B}_{F}^{2}.}}}$

This is slightly weaker than the corresponding upper bound ofE∥x_(j,(l))∥₂ ²≦∥B∥_(F) ² shown for distribution

in Wang, but the constant 4 makes little difference for the purposes ofthis disclosure, while the fact that q is discrete assists withderandomization.

Let

$x_{j} = {\frac{1}{c_{2}}{\sum\limits_{ = 1}^{c_{2}}x_{j,{()}}}}$

By linearity of expectation,

E[x _(j) ]=E[x _(j(l)) ]=B

_(j).

In bounding E∥x_(j)−B

_(j)∥₂ ², pairwise independence of the trials is enough to bound thisquantity:

${E{{x_{j} - {Bv}_{j}}}_{2}^{2}} = {{E{{x_{j} - {E\left\lbrack x_{j} \right\rbrack}}}_{2}^{=}\frac{1}{c_{2}}E{{x_{j,{()}} - {E\left\lbrack x_{j,{()}} \right\rbrack}}}_{2}^{2}} = {\frac{1}{c_{2}}E{{x_{j,{()}} - {Bv}_{j}}}_{2}^{2}}}$

(referred to herein as Equation 4), where the second equality usespairwise-independence, as it corresponds to the sum of the variancesbeing the variance of the sum.

The remainder of the proof of Theorem 15 in Wang only uses theirslightly stronger bound, as compared to Equation 4 herein. With theweaker bound herein, Wang would obtain

${E_{C_{2}}{{A - {{CC}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2}} \leq {{{A - {{AR}^{\dagger}R}}}_{F}^{2} + {\frac{4\rho}{c_{2}}{{B}_{F}^{2}.}}}$

whereas the bound of Wang would not have the factor of 4. In particular,there exists a choice of C₂ for which, for the corresponding C, it isthe case that

${{A - {{CC}^{\dagger}{AR}^{\dagger}R}}}_{F}^{2} < {{{A - {{AR}^{\dagger}R}}}_{F}^{2} + {\frac{4\rho}{c_{2}}{{B}_{F}^{2}.}}}$

Distribution q is a discrete distribution, and the trials need only bepairwise independent. Let h: [c₂]→[4n] be drawn from apairwise-independent hash function family

. Then

need only have size O(n²), as described in Carter. For each triall∈[c₂], compute h(l) is computed. Since the probabilities q_(i), fori=1, . . . , n, are integer multiples of 1/(4n) and Σ_(i)q_(i)=1, thelargest i∈[n] for which h(l)/(4n)≦Σ_(i′=1) ^(i) q_(i) is selected. Theprobability of picking i in the each trial is therefore q_(i). For eachh∈

, the corresponding column samples C₂ is computed, along with∥A−CC^(†)AR^(†)R∥_(F) ². The C₂ resulting in the smallest such value maybe chosen. Computing this value can be done in O(mc²+nr²+mnr+mnc+mn)time, and since ∥

|=O(n²), the overall time is O(n²(mc²+nr²+mnr+mnc)). Here c=c₁+c₂,r=r₁+r₂.

FIG. 5 illustrates a block diagram of a computer system 500 for use inimplementing a decomposition system 100 or method according to someembodiments. The decomposition systems 100 and methods described hereinmay be implemented in hardware, software (e.g., firmware), or acombination thereof. In an exemplary embodiment, the methods describedmay be implemented, at least in part, in hardware and may be part of themicroprocessor of a special or general-purpose computer system 500, suchas a personal computer, workstation, minicomputer, or mainframecomputer.

In an exemplary embodiment, as shown in FIG. 5, the computer system 500includes a processor 505, memory 510 coupled to a memory controller 515,and one or more input devices 545 and/or output devices 540, such asperipherals, that are communicatively coupled via a local I/O controller535. These devices 540 and 545 may include, for example, a printer, ascanner, a microphone, and the like. A conventional keyboard 550 andmouse 555 may be coupled to the I/O controller 535. The I/O controller535 may be, for example, one or more buses or other wired or wirelessconnections, as are known in the art. The I/O controller 535 may haveadditional elements, which are omitted for simplicity, such ascontrollers, buffers (caches), drivers, repeaters, and receivers, toenable communications.

The I/O devices 540, 545 may further include devices that communicateboth inputs and outputs, for instance disk and tape storage, a networkinterface card (NIC) or modulator/demodulator (for accessing otherfiles, devices, systems, or a network), a radio frequency (RF) or othertransceiver, a telephonic interface, a bridge, a router, and the like.

The processor 505 is a hardware device for executing hardwareinstructions or software, particularly those stored in memory 510. Theprocessor 505 may be any custom made or commercially availableprocessor, a central processing unit (CPU), an auxiliary processor amongseveral processors associated with the computer system 500, asemiconductor based microprocessor (in the form of a microchip or chipset), a macroprocessor, or other device for executing instructions. Theprocessor 505 includes a cache 570, which may include, but is notlimited to, an instruction cache to speed up executable instructionfetch, a data cache to speed up data fetch and store, and a translationlookaside buffer (TLB) used to speed up virtual-to-physical addresstranslation for both executable instructions and data. The cache 570 maybe organized as a hierarchy of more cache levels (L1, L2, etc.).

The memory 510 may include any one or combinations of volatile memoryelements (e.g., random access memory, RAM, such as DRAM, SRAM, SDRAM,etc.) and nonvolatile memory elements (e.g., ROM, erasable programmableread only memory (EPROM), electronically erasable programmable read onlymemory (EEPROM), programmable read only memory (PROM), tape, compactdisc read only memory (CD-ROM), disk, diskette, cartridge, cassette orthe like, etc.). Moreover, the memory 510 may incorporate electronic,magnetic, optical, or other types of storage media. Note that the memory510 may have a distributed architecture, where various components aresituated remote from one another but may be accessed by the processor505.

The instructions in memory 510 may include one or more separateprograms, each of which comprises an ordered listing of executableinstructions for implementing logical functions. In the example of FIG.5, the instructions in the memory 510 include a suitable operatingsystem (OS) 511. The operating system 511 essentially may control theexecution of other computer programs and provides scheduling,input-output control, file and data management, memory management, andcommunication control and related services.

Additional data, including, for example, instructions for the processor505 or other retrievable information, may be stored in storage 520,which may be a storage device such as a hard disk drive or solid statedrive. The stored instructions in memory 510 or in storage 520 mayinclude those enabling the processor to execute one or more aspects ofthe decomposition systems 100 and methods of this disclosure.

The computer system 500 may further include a display controller 525coupled to a display 530. In an exemplary embodiment, the computersystem 500 may further include a network interface 560 for coupling to anetwork 565. The network 565 may be an IP-based network forcommunication between the computer system 500 and any external server,client and the like via a broadband connection. The network 565transmits and receives data between the computer system 500 and externalsystems. In an exemplary embodiment, the network 565 may be a managed IPnetwork administered by a service provider. The network 565 may beimplemented in a wireless fashion, e.g., using wireless protocols andtechnologies, such as WiFi, WiMax, etc. The network 565 may also be apacket-switched network such as a local area network, wide area network,metropolitan area network, the Internet, or other similar type ofnetwork environment. The network 565 may be a fixed wireless network, awireless local area network (LAN), a wireless wide area network (WAN) apersonal area network (PAN), a virtual private network (VPN), intranetor other suitable network system and may include equipment for receivingand transmitting signals.

Decomposition systems 100 and methods according to this disclosure maybe embodied, in whole or in part, in computer program products or incomputer systems 500, such as that illustrated in FIG. 5.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the present invention has been presented for purposes ofillustration and description, but is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention. Theembodiments were chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Java, Smalltalk, C++ or the like,and conventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

What is claimed is:
 1. A computer-implemented method, comprising:constructing, by a computer processor, a matrix C based on a matrix A;constructing a matrix R based on the matrix A and the matrix C; andconstructing a matrix U based on the matrices A, C, and R; wherein thematrices C, U, and R comprise a CUR decomposition of the matrix A; andwherein the construction of the matrices C, U, and R comprises at leastone of an input-sparsity-time CUR and a deterministic CUR.
 2. The methodof claim 1, wherein: constructing the matrix C based on a matrix Acomprises: computing Z₁ as the top k singular vectors of A, wherein ascalar k is less than the rank of the matrix A, wherein a scalar cbetween 0 and 1, and wherein ∥A−AZ₁Z₁ ^(T)∥_(F) ²≦(1+ε)∥A−A_(k)∥_(F) ²;sampling O(k log k) columns from the transpose of Z₁, wherein thesampling uses leverage scores; down-sampling the sampled columns to C₁comprising c₁=O(k) columns using BSS sampling, wherein ∥A−C₁C₁ ^(†)∥_(F)²≦O(1)∥A−AZ₁Z₁ ^(T)∥_(F) ²; and adaptively sampling c₂=O(k/c) columns ofA, wherein${{{A - {C \cdot D}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\left( \frac{k}{c_{2}} \right){{A - {C_{1}C_{1}^{\dagger}A}}}_{F}^{2}}}};$and constructing the matrix R based on the matrix A comprises: findingZ₂ in the span of the matrix C, wherein ∥A−AZ₂Z₂ ^(T)∥_(F)²≦(1+ε)∥A−A_(k)∥_(F) ²; sampling O(k log k) rows of Z₂, wherein thesampling uses leverage scores; down-sampling the sampled rows to R₁comprising r₁=O(k) columns using BSS sampling, wherein ∥A−R₁R₁ ^(†)∥_(F)²≦O(1)∥A−AZ₂Z₂ ^(T)∥_(F) ²; and adaptively sampling r₂=O(k/ε) rows of A,wherein${{A - {Z_{2}Z_{2}^{T}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {Z_{2}Z_{2}^{T}A}}}_{F}^{2} + {{{rank}\left( \frac{Z_{1}Z_{2}^{T}A}{r_{2}} \right)}{{{A - {{AR}_{1}R_{1}^{\dagger}}}}_{F}^{2}.}}}$3. The method of claim 2, further comprising projecting one or morecolumns of the matrix A*W onto the matrix C, wherein the matrix W is arandom subspace embedding.
 4. The method of claim 2, wherein:down-sampling the sampled columns to C₁ comprising c₁=O(k) columns usingBSS sampling uses an input-sparsity-time BSS sampling algorithm;adaptively sampling c₂=O(k/ε) columns of A occurs in input-sparsitytime; down-sampling the sampled rows to R₁ comprising r₁=O(k) columnsusing BSS sampling uses the input-sparsity-time BSS sampling algorithm;adaptively sampling r₂=O(k/ε) rows of A occurs in input-sparsity time;and constructing the matrix U based on the matrices A, C, and R isperformed in input-sparsity time.
 5. The method of claim 4, wherein theconstruction of the matrices C, U, and R occurs in input-sparsity time.6. The method of claim 2, wherein: adaptively sampling c₂=O(k/ε) columnsof A comprises derandomization; and adaptively sampling r₂=O(k/ε) rowsof A comprises derandomization.
 7. The method of claim 6, wherein theconstruction of the matrices C, U, and R comprises a deterministic CURdecomposition.
 8. A system comprising: a memory; and one or moreprocessors, communicatively coupled to the memory, the one or moreprocessors configured to: constructing, by a computer processor, amatrix C based on a matrix A; constructing a matrix R based on thematrix A and the matrix C; and constructing a matrix U based on thematrices A, C, and R; wherein the matrices C, U, and R comprise a CURdecomposition of the matrix A; and wherein the construction of thematrices C, U, and R comprises at least one of an input-sparsity-timeCUR and a deterministic CUR.
 9. The system of claim 8, wherein: toconstruct the matrix C based on a matrix A, the one or more processorsare further configured to: compute Z₁ as the top k singular vectors ofA, wherein a scalar k is less than the rank of the matrix A, wherein ascalar c between 0 and 1, and wherein ∥A−AZ₁Z₁ ^(T)∥_(F)²≦(1−ε)∥A−A_(k)∥_(F) ²; sample O(k log k) columns from the transpose ofZ₁, wherein the sampling uses leverage scores; down-sample the sampledcolumns to C₁ comprising c₁=O(k) columns using BSS sampling, wherein∥A−C₁C₁ ^(†)∥_(F) ²≦O(1)∥A−AZ₁Z₁ ^(T)∥_(F) ²; and adaptively samplec₂=O(k/ε) columns of A, wherein${{{A - {C \cdot D}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\left( \frac{k}{c_{2}} \right){{A - {C_{1}C_{1}^{\dagger}A}}}_{F}^{2}}}};$and to construct the matrix R based on the matrix A, the one or moreprocessors are further configured to: find Z₂ in the span of the matrixC, wherein ∥A−AZ₂Z₂ ^(T)∥_(F) ²≦(1+ε)∥A−A_(k)∥_(F) ²; sample O(k log k)rows of Z₂, wherein the sampling uses leverage scores; down-sample thesampled rows to R₁ comprising r₁=O(k) columns using BSS sampling,wherein ∥Λ−R₁R₁ ^(†)∥_(F) ²≦O(1)∥Λ−ΛZ₂Z₂ ^(T)Λ∥_(F) ²; and adaptivelysample r₂=O(k/ε) rows of A, wherein${{A - {Z_{2}Z_{2}^{T}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {Z_{2}Z_{2}^{T}A}}}_{F}^{2} + {{{rank}\left( \frac{Z_{1}Z_{2}^{T}A}{r_{2}} \right)}{{{A - {{AR}_{1}R_{1}^{\dagger}}}}_{F}^{2}.}}}$10. The system of claim 9, wherein the one or more processors arefurther configured to project one or more columns of the matrix A*W ontothe matrix C, and wherein the matrix W is a random subspace embedding.11. The system of claim 9, wherein: down-sampling the sampled columns toC₁ comprising c₁=O(k) columns using BSS sampling uses aninput-sparsity-time BSS sampling algorithm; adaptively samplingc₂=O(k/ε) columns of A occurs in input-sparsity time; down-sampling thesampled rows to R₁ comprising r₁=O(k) columns using BSS sampling usesthe input-sparsity-time BSS sampling algorithm; adaptively samplingr₂=O(k/ε) rows of A occurs in input-sparsity time; and constructing thematrix U based on the matrices A, C, and R is performed ininput-sparsity time.
 12. The system of claim 11, wherein theconstruction of the matrices C, U, and R occurs in input-sparsity time.13. The system of claim 9, wherein: adaptively sampling c₂=O(k/ε)columns of A comprises derandomization; and adaptively samplingr₂=O(k/ε) rows of A comprises derandomization.
 14. The system of claim13, wherein the construction of the matrices C, U, and R comprises adeterministic CUR decomposition.
 15. A computer program product forfinding a CUR decomposition, the computer program product comprising acomputer readable storage medium having program instructions embodiedtherewith, the program instructions executable by a processor to causethe processor to perform a method comprising: constructing a matrix Cbased on a matrix A; constructing a matrix R based on the matrix A andthe matrix C; and constructing a matrix U based on the matrices A, C,and R; wherein the matrices C, U, and R comprise a CUR decomposition ofthe matrix A; and wherein the construction of the matrices C, U, and Rcomprises at least one of an input-sparsity-time CUR and a deterministicCUR.
 16. The computer program product of claim 15, wherein: constructingthe matrix C based on a matrix A comprises: computing Z₁ as the top ksingular vectors of A, wherein a scalar k is less than the rank of thematrix A, wherein a scalar c between 0 and 1, and wherein ∥A−AZ₂Z₁^(T)∥_(F) ²≦(1+ε)∥A−A_(k)∥_(F) ²; sampling O(k log k) columns from thetranspose of Z₁, wherein the sampling uses leverage scores;down-sampling the sampled columns to C₁ comprising c₁=O(k) columns usingBSS sampling, wherein ∥A−C₁C₁ ^(†)∥_(F) ²≦O(1)∥A−AZ₁Z₁ ^(T)∥_(F) ²; andadaptively sampling c₂=O(k/ε) columns of A, wherein${{{A - {C \cdot D}}}_{F}^{2} \leq {{{A - A_{k}}}_{F}^{2} + {\left( \frac{k}{c_{2}} \right){{A - {C_{1}C_{1}^{\dagger}A}}}_{F}^{2}}}};$and constructing the matrix R based on the matrix A comprises: findingZ₂ in the span of the matrix C, wherein ∥A−AZ₂Z₂ ^(T)∥_(F)²≦(1+ε)∥A−A_(k)∥_(F) ²; sampling O(k log k) rows of Z₂, wherein thesampling uses leverage scores; down-sampling the sampled rows to R₁comprising r_(l)=O(k) columns using BSS sampling, wherein ∥A−R₁R₁^(†)∥_(F) ²≦O(1)∥A−AZ₂Z₂ ^(T)A∥_(F) ²; and adaptively sampling r₂=O(k/ε)rows of A, wherein${{A - {Z_{2}Z_{2}^{T}{AR}^{\dagger}R}}}_{F}^{2} \leq {{{A - {Z_{2}Z_{2}^{T}A}}}_{F}^{2} + {{{rank}\left( \frac{Z_{1}Z_{2}^{T}A}{r_{2}} \right)}{{{A - {{AR}_{1}R_{1}^{\dagger}}}}_{F}^{2}.}}}$17. The computer program product of claim 16, wherein: down-sampling thesampled columns to C₁ comprising c₁=O(k) columns using BSS sampling usesan input-sparsity-time BSS sampling algorithm; adaptively samplingc₂=O(k/ε) columns of A occurs in input-sparsity time; down-sampling thesampled rows to R₁ comprising r₁=O(k) columns using BSS sampling usesthe input-sparsity-time BSS sampling algorithm; adaptively samplingr₂=O(k/ε) rows of A occurs in input-sparsity time; and constructing thematrix U based on the matrices A, C, and R is performed ininput-sparsity time.
 18. The computer program product of claim 17,wherein the construction of the matrices C, U, and R occurs ininput-sparsity time.
 19. The computer program product of claim 16,wherein: adaptively sampling c₂=O(k/ε) columns of A comprisesderandomization; and adaptively sampling r₂=O(k/ε) rows of A comprisesderandomization.
 20. The computer program product of claim 19, whereinthe construction of the matrices C, U, and R comprises a deterministicCUR decomposition.